{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Further Hypothesis Testing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select this cell and type Ctrl-Enter to execute the code below.\n",
    "\n",
    "library(tidyverse)\n",
    "\n",
    "set_plot_dimensions <- function(width_choice, height_choice) {\n",
    "    options(repr.plot.width=width_choice, repr.plot.height=height_choice)\n",
    "}\n",
    "\n",
    "cbPal <- c(\"#E69F00\", \"#56B4E9\", \"#009E73\", \"#F0E442\", \"#CC79A7\", \"#0072B2\", \"#D55E00\")\n",
    "\n",
    "set_plot_dimensions(5, 4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# You should see \"Attaching packages\" and some ticks by the packages loaded.\n",
    "# The \"Conflicts\" aren't a problem.\n",
    "\n",
    "# Other problems loading the library? Try running this cell.\n",
    "\n",
    "install.packages('tidyverse')\n",
    "\n",
    "library(tidyverse)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3 - Comparing variances of two groups"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run this cell to load the data.\n",
    "\n",
    "data <- read_csv(\"stars.csv\")\n",
    "\n",
    "type_key <- c('Brown Dwarf', 'Red Dwarf', 'White Dwarf', 'Main Sequence', 'Supergiant','Hypergiant')\n",
    "spectral_classes <- c('O','B','A','F','G','K','M')\n",
    "\n",
    "data$type <- factor(data$type)\n",
    "data$spectral_class <- factor(data$spectral_class, levels=spectral_classes)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The t-test only compares the *means* of the two groups. \n",
    "Next, Dr Howe would like you to check their *variances*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question: do types 4 and 5 have the same variance in log(luminosity)?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Q-Q plot\n",
    "\n",
    "The [*quantile-quantile plot*](https://en.wikipedia.org/wiki/Q–Q_plot) (Q-Q plot) is a simple, graphical method to check whether two sets of observations appear to come from the same distribution, or to compare one set of data to a theoretical distribution.\n",
    "\n",
    "It is made by plotting the quantiles (i.e. percentiles) of the two distributions against each other.\n",
    "\n",
    "If the variances are the same, the Q-Q plot will approximate a straight line with gradient 1.\n",
    "\n",
    "We can find the percentiles for our sample with the `quantile()` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "type4 <- \n",
    "    data %>% \n",
    "    filter(type == 4) %>% \n",
    "    pull(luminosity) %>% \n",
    "    log\n",
    "\n",
    "type5 <-\n",
    "    data %>% \n",
    "    filter(type == 5) %>% \n",
    "    pull(luminosity) %>% \n",
    "    log\n",
    "\n",
    "x <- seq(0, 1, 0.01)\n",
    "t4q <- quantile(type4, x)\n",
    "t5q <- quantile(type5, x)\n",
    "\n",
    "ymin <- 11\n",
    "ymax <- 14\n",
    "\n",
    "set_plot_dimensions(6, 4)\n",
    "par(mfrow=c(1,2))\n",
    "plot(x,t4q,col=cbPal[5],pch=16,ylim=c(ymin,ymax),ylab=\"type 4\")\n",
    "plot(x,t5q,col=cbPal[6],pch=16,ylim=c(ymin,ymax),ylab=\"type 5\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plotting type 5 against type 4:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "set_plot_dimensions(5, 5)\n",
    "plot(t4q,t5q,col='grey',pch=16,xlab=\"type 4\",ylab=\"type 5\")\n",
    "\n",
    "# the line with gradient 1, passing through Q50:\n",
    "m <- 1\n",
    "c <- t5q[50] - t4q[50] * m\n",
    "lines(t4q, m * t4q + c, col='black')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This looks like a pretty good fit, but we can do an [*F-test*](https://en.wikipedia.org/wiki/F-test_of_equality_of_variances) to be more rigorous:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### F-test for equality of variances"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Theory\n",
    "\n",
    "Once again, we need a two-tailed test:\n",
    "\n",
    "$H_0$: The two populations have identical variance  $\\sigma^2 = \\sigma_1^2 = \\sigma_2^2$.\n",
    "\n",
    "$H_1$: The two populations have non-identical variances, $\\sigma_1^2 \\ne \\sigma_2^2$.\n",
    "\n",
    "The test statistic is simply the ratio of the sample variances:\n",
    "\n",
    "$$F = \\frac{s_1^2}{s_2^2}$$.\n",
    "\n",
    "Under $H_0$, $F$ follows an [*F-distribution*](https://en.wikipedia.org/wiki/F-distribution) with parameters $(n_1 - 1,n_2 - 1)$.\n",
    "\n",
    "We use this distribution to calculate a p-value for the observed value of the test statistic, $F$.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Assumpions\n",
    "\n",
    "- The two samples both follow normal distributions.\n",
    "\n",
    "Note that the means of the two populations may differ.\n",
    "The F-test is highly sensitive to deviations from the assumption of normality.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Application\n",
    "\n",
    "We will set $\\alpha=0.05$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can visualise the F-distribution corresponding to our example ($n_1 = n_2 = 40$):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x <- seq(0.1,3,0.01)\n",
    "set_plot_dimensions(5, 4)\n",
    "plot(x,df(x,39,39), xlab=\"F\", ylab=\"pdf\", type=\"l\", col=\"grey\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will calculate the test statistic:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fstat <- var(type4)/var(type5)\n",
    "print(paste(\"F =\",fstat))\n",
    "\n",
    "set_plot_dimensions(5, 4)\n",
    "plot(x,df(x,39,39), xlab=\"F\", ylab=\"pdf\", type=\"l\", col=\"grey\")\n",
    "\n",
    "x_region <- seq(0.1,fstat,0.01)\n",
    "polygon(c(x_region,fstat,0.01), c(df(x_region,39,39),0,0), border=NA, col=\"lightgrey\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use the CDF to calculate the left-tail $p$-value and double it for a two-tailed test:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p_value <- pf(fstat,39,39) * 2\n",
    "print(paste(\"p =\",p_value))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, $p>\\alpha$ so we accept the null hypothesis of equal variance, at the 5% level.\n",
    "\n",
    "<br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
